Creation of a model to predict unit price of a car for retail segment based on the previous contracts data.
Dataset name : "Imaginary Auto Company_Sep 22.xlsx".
booking to delivery date : Delivery time post contract(medium, long, short, immediate).unit price($000) : Price per unittype of sale : States whether sale is of type wholesale or retaildate of sale : Date when contract was entered in a yyyy-mm-dd formatmode of transport : States mode of transport for delivering products(Truck, container, barge)no of units : Units sold per contractcust_group_name : Group that the customer belongs to.cust_segment_name : Customer segment (Star, prime, excellent or more)Customer Id : Unique customer iddelivery_month_from : shows when delivery of a contract startsdelivery_month_to : shows when delivery of a contract endstype : shows whether product is manual or automaticitem_class_l1 : shows approximate price range a product belongs to (basic, premium, mid-range)item_class_l2 : shows brand of productitem_class_l3 : classifies product into different buckets within a brandcolour : shows colour of productplant_nr : shows plant number at which product is producedplant_city : shows city of productionplant_country_name : shows country of production(Phillipines and Vietnam)most expensive part_l1 : shows which part of product is most expensivemost expensive part_l2 : classifies product into different buckets per most expensive part l1shipto_city : shows city to which product is shippedshipto_nr : shows customers's shipping idsoldto_city : shows city to which product is sold (note ship to city and sold to city can be different)soldto_nr : shows customers's sold to numbercar_nr : shows product's unique numbershipto_country : shows country to which product is shippedsoldto_country : shows country to which product is sold#!pip install missingno
import pandas as pd
import numpy as np
import missingno as msno
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import plotly.express as px
pd.set_option('display.max_rows',30)
pd.set_option('display.max_columns', 30)
pd.set_option('display.width', 1000)
pd.options.mode.chained_assignment = None
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import streamlit as st
import datetime as dt
import warnings
warnings.filterwarnings('ignore')
plt.style.use('fivethirtyeight')
DATA_URL = ".\Imaginary Auto Company_Sep 22.xlsx"
def load_data():
data = pd.read_excel(DATA_URL,sheet_name = "Contracts data",engine = "openpyxl")
return data
df = load_data()
#df = pd.read_excel(".\Imaginary Auto Company_Sep 22.xlsx",sheet_name = "Contracts data",engine = "openpyxl")
df.head()
| Unnamed: 0 | booking to delivery date | unit price($000) | type of sale | date of sale | mode of transport | no of units | cust_group_name | cust_segment_name | Customer Id | delivery_month_from | delivery_month_to | type | item_class_l1 | item_class_l2 | item_class_l3 | colour | plant_nr | plant_city | plant_country_name | most expensive part_l1 | most expensive part_l2 | shipto_city | shipto_nr | soldto_city | soldto_nr | car_nr | shipto_country | soldto_country | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | long | 1662.54 | retail | 2159-09-26 | Truck | 2640.00 | aa | Prime | 953.0-952.0 | 2160-04-01 | 2161-10-31 | manual | Premium | Pagani | Pagani1 | . | 1085 | Quezon City | Philippines | bonnet | bonner | Algeria | 953 | Algeria | 952 | Pag99992258 | Algeria | Algeria |
| 1 | 4 | long | 1662.54 | retail | 2159-09-26 | Truck | 0.00 | aa | Prime | 953.0-952.0 | 2160-04-01 | 2161-10-31 | manual | Premium | Pagani | Pagani1 | . | 1085 | Quezon City | Philippines | bonnet | bonner | Algeria | 953 | Algeria | 952 | Pag99992258 | Algeria | Algeria |
| 2 | 5 | long | 1662.54 | retail | 2159-09-26 | Truck | 5720.00 | aa | Prime | 953.0-952.0 | 2160-04-01 | 2161-10-31 | manual | Premium | Pagani | Pagani1 | . | 1085 | Quezon City | Philippines | bonnet | bonner | Algeria | 953 | Algeria | 952 | Pag99992258 | Algeria | Algeria |
| 3 | 6 | long | 1896.18 | retail | 2161-02-26 | Truck | 2051.28 | aa | Prime | 953.0-952.0 | 2160-04-01 | 2161-10-31 | manual | Premium | Pagani | Pagani1 | . | 1085 | Quezon City | Philippines | bonnet | bonner | Algeria | 953 | Algeria | 952 | Pag99992258 | Algeria | Algeria |
| 4 | 7 | long | 1206.26 | retail | 2160-06-22 | Truck | 1386.00 | aa | Prime | 953.0-952.0 | 2160-04-01 | 2161-10-31 | manual | Premium | Pagani | Pagani1 | . | 1085 | Quezon City | Philippines | bonnet | bonner | Algeria | 953 | Algeria | 952 | Pag99992258 | Algeria | Algeria |
numCols = df.select_dtypes("number").columns
catCols = df.select_dtypes("object").columns
numCols= list(set(numCols))
catCols= list(set(catCols))
print("Numerical columns : ",", ".join(numCols),end = "\n")
print("Categorical columns : ",", ".join(catCols),end = " ")
Numerical columns : unit price($000), soldto_nr, Unnamed: 0, shipto_nr, no of units, plant_nr Categorical columns : mode of transport, type, soldto_city, car_nr, plant_city, shipto_country, item_class_l2, most expensive part_l2, type of sale, item_class_l1, plant_country_name, cust_group_name, colour, soldto_country, most expensive part_l1, booking to delivery date, cust_segment_name, item_class_l3, shipto_city, Customer Id
df['type of sale'].value_counts()
wholesale 59352 retail 21141 Name: type of sale, dtype: int64
retail_df = df.loc[df['type of sale'] == 'retail']
retail_df.rename(columns={'unit price($000)': 'unit_price_k','no of units':'no_units',
'booking to delivery date':'delivery_duration','most expensive part_l2': 'most_expensive_part_l2',
'type of sale':'type_of_sale','date of sale':'date_of_sale','mode of transport':'mode_of_transport',
'Customer Id':'customer_id','most expensive part_l1': 'most_expensive_part_l1'
},inplace = True)
retail_df.drop(['Unnamed: 0'],axis = 1,inplace=True)
numCols = retail_df.select_dtypes("number").columns
catCols = retail_df.select_dtypes("object").columns
numCols= list(set(numCols))
catCols= list(set(catCols))
print("Numerical columns : ",", ".join(numCols),end = "\n")
print("Categorical columns : ",", ".join(catCols),end = " ")
Numerical columns : unit_price_k, soldto_nr, shipto_nr, plant_nr, no_units Categorical columns : type, soldto_city, car_nr, plant_city, shipto_country, item_class_l2, delivery_duration, item_class_l1, plant_country_name, cust_group_name, colour, most_expensive_part_l2, soldto_country, type_of_sale, cust_segment_name, customer_id, mode_of_transport, item_class_l3, most_expensive_part_l1, shipto_city
print(f"Dataset has {retail_df.shape[0]} rows and {retail_df.shape[1]} columns")
Dataset has 21141 rows and 28 columns
retail_df.isnull().sum()
delivery_duration 0 unit_price_k 0 type_of_sale 0 date_of_sale 0 mode_of_transport 0 no_units 0 cust_group_name 0 cust_segment_name 12 customer_id 0 delivery_month_from 0 delivery_month_to 0 type 0 item_class_l1 0 item_class_l2 0 item_class_l3 0 colour 0 plant_nr 0 plant_city 1 plant_country_name 1 most_expensive_part_l1 12 most_expensive_part_l2 24 shipto_city 87 shipto_nr 0 soldto_city 0 soldto_nr 0 car_nr 0 shipto_country 0 soldto_country 0 dtype: int64
#retail_df = retail_df.fillna(retail_df.mode().iloc(0))
retail_df.isnull().sum()
delivery_duration 0 unit_price_k 0 type_of_sale 0 date_of_sale 0 mode_of_transport 0 no_units 0 cust_group_name 0 cust_segment_name 12 customer_id 0 delivery_month_from 0 delivery_month_to 0 type 0 item_class_l1 0 item_class_l2 0 item_class_l3 0 colour 0 plant_nr 0 plant_city 1 plant_country_name 1 most_expensive_part_l1 12 most_expensive_part_l2 24 shipto_city 87 shipto_nr 0 soldto_city 0 soldto_nr 0 car_nr 0 shipto_country 0 soldto_country 0 dtype: int64
no_units_sold = retail_df[retail_df['no_units'] == 0]
print("Number of units not sold",no_units_sold.shape[0])
print("Costs involved with these unsold units",round(no_units_sold['unit_price_k'].sum(),2),"$")
print("Percentage loss : ",round((no_units_sold['unit_price_k'].sum()/retail_df['unit_price_k'].sum())*100,2),"%")
Number of units not sold 2706 Costs involved with these unsold units 4499978.21 $ Percentage loss : 13.5 %
retail_df.colour.value_counts()
. 20652 Olive 105 Purple 95 Red 91 Lavender 81 Mauvre 40 Green 36 Yellow 26 Dark Purple 7 Dark Green 5 Ochre 1 White 1 Teal 1 Name: colour, dtype: int64
retail_df['colour'] = retail_df['colour'].apply(lambda x:x.replace(".","Black"))
sns.countplot(retail_df.colour)
plt.xticks(rotation = 45);
plt.figure(figsize=(20, 5))
plt.subplot(1,3,1)
sns.boxplot(x = 'mode_of_transport', y = 'unit_price_k', data = retail_df)
plt.subplot(1,3,2)
sns.boxplot(x = 'type', y = 'unit_price_k', data = retail_df)
plt.subplot(1,3,3)
sns.boxplot(x = 'plant_country_name', y = 'unit_price_k', data = retail_df)
<AxesSubplot:xlabel='plant_country_name', ylabel='unit_price_k'>
plt.figure(figsize=(20, 5))
plt.subplot(1,3,1)
sns.boxplot(x = 'mode_of_transport', y = 'no_units', data = retail_df)
plt.subplot(1,3,2)
sns.boxplot(x = 'type', y = 'no_units', data = retail_df)
plt.subplot(1,3,3)
sns.boxplot(x = 'plant_country_name', y = 'no_units', data = retail_df)
<AxesSubplot:xlabel='plant_country_name', ylabel='no_units'>
fig = plt.figure(figsize = (12,3))
plt.subplot(1,2,1)
sns.histplot(retail_df['no_units'],bins = 40 ,color='r')
plt.subplot(1,2,2)
sns.histplot(retail_df['unit_price_k'],bins = 40,color='b')
plt.subplots_adjust(wspace = 0.8)
retail_df.loc[(retail_df['unit_price_k']>50000)]
| delivery_duration | unit_price_k | type_of_sale | date_of_sale | mode_of_transport | no_units | cust_group_name | cust_segment_name | customer_id | delivery_month_from | delivery_month_to | type | item_class_l1 | item_class_l2 | item_class_l3 | colour | plant_nr | plant_city | plant_country_name | most_expensive_part_l1 | most_expensive_part_l2 | shipto_city | shipto_nr | soldto_city | soldto_nr | car_nr | shipto_country | soldto_country | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 58302 | immediate | 193971.58 | retail | 2159-03-15 | Truck | 0.22 | tc | Prime | 4190.0-4195.0 | 2159-03-15 | 2159-03-31 | manual | Premium | Aspark | Aspark7 | Black | 1085 | Quezon City | Philippines | Window regulator | glass | Kétou | 4190 | Boussé | 4195 | Asp99990000 | Benin | Burkina Faso |
px.box(retail_df['unit_price_k'])
sns.jointplot(x = retail_df['no_units'], y = retail_df['unit_price_k'], hue = retail_df["plant_country_name"])
<seaborn.axisgrid.JointGrid at 0x17cc5a96250>
retail_df = retail_df.loc[(retail_df['unit_price_k']<50000)]
retail_df['unit_price_k'].describe()
count 21140.000000 mean 1567.710181 std 1728.860511 min -16287.260000 25% 497.420000 50% 1213.700466 75% 1804.000000 max 33114.840000 Name: unit_price_k, dtype: float64
retail_df = retail_df.loc[(retail_df['unit_price_k']<50000)]
px.box(retail_df['unit_price_k'])
sns.displot(retail_df['unit_price_k'], bins = 60)
<seaborn.axisgrid.FacetGrid at 0x17cc4424340>
sns.displot(retail_df['no_units'], bins = 60)
<seaborn.axisgrid.FacetGrid at 0x17cc3ff87f0>
sns.relplot(x = retail_df['no_units'], y = retail_df['unit_price_k'],style = retail_df['plant_country_name'],hue = retail_df['mode_of_transport'])
plt.xticks(rotation = 45);
#outl_df = retail_df.loc[(retail_df['unit price($000)']<0)]
print(f"Total unit price for outliers detected $",round(retail_df['unit_price_k'].sum(),2))
print("Number of units detected in the outliers ", retail_df['no_units'].sum())
Total unit price for outliers detected $ 33141393.22 Number of units detected in the outliers 85015197.454
fig,ax = plt.subplots(4,1,figsize = (7,25))
ax1,ax2,ax3,ax4 = ax.flatten()
# cnt = retail_df.groupby(['colour']).sum()['unit_price_k'].sort_values(ascending = False).to_frame()[:20]
# sns.barplot(x = cnt['unit_price_k'],y = cnt.index,ax = ax1 ,palette= 'winter')
# ax1.set_title('Price distribution according to colour of the car')
cnt = retail_df.groupby(['shipto_country']).sum()['unit_price_k'].sort_values(ascending = False).to_frame()[:20]
sns.barplot(x = cnt['unit_price_k'],y = cnt.index,ax = ax1 ,palette= 'inferno')
ax1.set_title('Countries with shipped price distribution')
cnt = retail_df.groupby(['most_expensive_part_l1']).sum()['unit_price_k'].sort_values(ascending = False).to_frame()[:20]
sns.barplot(x = cnt['unit_price_k'],y = cnt.index,ax = ax2 ,palette= 'viridis')
ax2.set_title('Expensive parts for cars')
cnt = retail_df.groupby(['soldto_country']).sum()['unit_price_k'].sort_values(ascending = False).to_frame()[:20]
sns.barplot(x = cnt['unit_price_k'],y = cnt.index,ax = ax3 ,palette= 'summer')
ax3.set_title('Maximum sales per country')
cnt = retail_df.groupby(['plant_city']).sum()['unit_price_k'].sort_values(ascending = False).to_frame()[:20]
sns.barplot(x = cnt['unit_price_k'],y = cnt.index,ax = ax4 ,palette= 'autumn')
ax4.set_title('Plant city that manufactures the high profit sales')
plt.subplots_adjust(wspace = 0.8,hspace=0.3)
fig,ax = plt.subplots(4,1,figsize = (7,25))
ax1,ax2,ax3,ax4 = ax.flatten()
# cnt = retail_df.groupby(['colour']).sum()['no_units'].sort_values(ascending = False).to_frame()[:10]
# sns.barplot(x = cnt['no_units'],y = cnt.index,ax = ax1 ,palette= 'winter')
# ax1.set_title('Number of units sold according to colour of the car')
cnt = retail_df.groupby(['shipto_country']).sum()['no_units'].sort_values(ascending = False).to_frame()[:15]
sns.barplot(x = cnt['no_units'],y = cnt.index,ax = ax1 ,palette= 'inferno')
ax1.set_title('Number of units shipped per coutry')
cnt = retail_df.groupby(['most_expensive_part_l1']).sum()['no_units'].sort_values(ascending = False).to_frame()[:20]
sns.barplot(x = cnt['no_units'],y = cnt.index,ax = ax2 ,palette= 'viridis')
ax2.set_title('Units sold with most desired expensive parts')
cnt = retail_df.groupby(['soldto_country']).sum()['no_units'].sort_values(ascending = False).to_frame()[:10]
sns.barplot(x = cnt['no_units'],y = cnt.index,ax = ax3 ,palette= 'summer')
ax3.set_title('Maximum units sold per country')
cnt = retail_df.groupby(['plant_city']).sum()['no_units'].sort_values(ascending = False).to_frame()[:10]
sns.barplot(x = cnt['no_units'],y = cnt.index,ax = ax4 ,palette= 'autumn')
ax4.set_title('City that manufactures the highest number of units')
plt.subplots_adjust(wspace = 0.8,hspace=0.3)
plt.figure(figsize = (8,5))
price_item_class_l2 = retail_df.groupby(['cust_group_name']).sum()['unit_price_k'].sort_values(ascending = False).to_frame()[:20]
sns.barplot(x = 'unit_price_k', y = price_item_class_l2.index,data = price_item_class_l2,palette='spring')
plt.title('')
plt.xlabel('unit_price_k',fontsize = 15)
plt.ylabel('Customer group name',fontsize = 10,rotation = 'vertical')
Text(0, 0.5, 'Customer group name')
plt.figure(figsize = (8,3))
price_item_class_l2 = retail_df.groupby(['cust_segment_name']).sum()['unit_price_k'].sort_values(ascending = False).to_frame()[:20]
sns.barplot(x = 'unit_price_k', y = price_item_class_l2.index,data = price_item_class_l2,palette='spring')
plt.title('Revenue per customer group',fontsize = 15);
plt.xlabel('unit price($000)',fontsize = 15)
plt.ylabel('Customer group name',fontsize = 15,rotation = 'vertical');
plt.figure(figsize = (12,5))
cnt = retail_df.groupby(['item_class_l2']).mean()['unit_price_k'].sort_values(ascending = False).to_frame()[:50]
sns.barplot(x = cnt['unit_price_k'],y = cnt.index,palette= 'winter')
plt.title("Revenue per brand")
plt.xticks(rotation = 45);
plt.figure(figsize = (12,8))
cnt = retail_df.groupby(['item_class_l3']).mean()['unit_price_k'].sort_values(ascending = False).to_frame()[:50]
sns.barplot(x = cnt['unit_price_k'],y = cnt.index,palette= 'twilight_shifted')
ax2.set_title('Revenue per model')
#plt.yticks(rotation = 45)
plt.subplots_adjust(wspace = 0.8)
retail_df_neg = retail_df.loc[retail_df['unit_price_k'] > 0]
sales_df = retail_df_neg.groupby(['item_class_l3','cust_segment_name']).agg({'unit_price_k':'mean','no_units':'mean'}).\
sort_values(by = "no_units",ascending=False)
sales_df.reset_index(inplace = True)
sales_df
| item_class_l3 | cust_segment_name | unit_price_k | no_units | |
|---|---|---|---|---|
| 0 | Bentley | Excellent | 521.335513 | 14043.668618 |
| 1 | Pagani1 | More | 1176.169780 | 11380.355890 |
| 2 | Ferrari | Prime | 1397.517853 | 9091.578461 |
| 3 | Aspark11 | Excellent | 1454.371525 | 8874.576271 |
| 4 | Bentley | Star | 539.645315 | 7718.549747 |
| ... | ... | ... | ... | ... |
| 93 | Pagani2 | Excellent | 6746.520000 | 28.233333 |
| 94 | Rolls-Royce2 | Prime | 1980.000000 | 22.000000 |
| 95 | Rolls-Royce4 | Prime | 704.000000 | 22.000000 |
| 96 | Buggati | Excellent | 4400.000000 | 22.000000 |
| 97 | Aston Martin4 | Star | 3740.000000 | 4.400000 |
98 rows × 4 columns
px.scatter(data_frame = sales_df,x = 'unit_price_k',y = 'no_units',hover_data=['cust_segment_name'],color = 'item_class_l3')
sales_grp_cust_df = retail_df_neg.groupby(['item_class_l3','cust_group_name']).agg({'unit_price_k':'sum','no_units':'sum'}).\
sort_values(by = "unit_price_k",ascending=False)
sales_grp_cust_df.reset_index(inplace = True)
px.scatter(data_frame = sales_grp_cust_df,x = 'unit_price_k',y = 'no_units',hover_data=['cust_group_name'],color = 'item_class_l3')
sales_l2_df = retail_df_neg.groupby(['item_class_l2']).agg({'unit_price_k':'sum','no_units':'sum'}).\
sort_values(by = "no_units",ascending=False)
sales_l2_df.reset_index(inplace = True)
sales_l2_df
| item_class_l2 | unit_price_k | no_units | |
|---|---|---|---|
| 0 | Bentley | 1.626001e+06 | 2.236847e+07 |
| 1 | Lotus Cars | 3.809138e+06 | 1.621270e+07 |
| 2 | Ferrari | 3.228916e+06 | 1.399056e+07 |
| 3 | Aspark | 4.503512e+06 | 6.433299e+06 |
| 4 | Pagani | 5.386945e+06 | 5.726159e+06 |
| 5 | W Motors | 3.304286e+06 | 4.008274e+06 |
| 6 | McLaren Automotive | 1.143466e+06 | 3.226757e+06 |
| 7 | Rolls-Royce | 4.705128e+06 | 2.595787e+06 |
| 8 | Aston Martin | 1.579900e+06 | 1.389890e+06 |
| 9 | Bugatti | 2.310537e+06 | 9.569140e+05 |
| 10 | Mercedes-Benz | 1.054640e+06 | 5.648836e+05 |
| 11 | Lamborghini | 3.967731e+05 | 3.513853e+05 |
| 12 | Rimac Automobili | 2.694167e+05 | 2.455898e+05 |
| 13 | Koenigsegg | 2.211000e+03 | 2.200000e+03 |
px.scatter(data_frame = sales_l2_df,x = 'unit_price_k',y = 'no_units',hover_data=['item_class_l2'],color = 'item_class_l2',)
retail_df_neg.shape
(19432, 28)
high_sales_brand_df = retail_df_neg.loc[(retail_df_neg['item_class_l2'] == 'Bentley') | (retail_df_neg['item_class_l2'] == 'Lotus Cars') \
| (retail_df_neg['item_class_l2'] == 'Ferrari') | (retail_df_neg['item_class_l2'] == 'Aspark')]
high_sales_brand_df.reset_index(inplace = True)
high_sales_brand_df.groupby('item_class_l2').agg({'unit_price_k':'mean'}).sort_values(by = "unit_price_k",ascending=False)
| unit_price_k | |
|---|---|
| item_class_l2 | |
| Aspark | 2055.459627 |
| Ferrari | 1496.253948 |
| Lotus Cars | 998.725128 |
| Bentley | 539.840945 |
catCols
for column in catCols:
retail_df[column] = pd.factorize(retail_df[column])[0]
retail_df = retail_df.drop_duplicates()
def date_features(retail_df):
retail_df["date_of_sale"] = pd.to_datetime(retail_df["date_of_sale"])
retail_df["day"] = retail_df.date.dt.day
retail_df["month"] = retail_df.date.dt.month
retail_df["week_day"] = retail_df.date.dt.weekday
retail_df.drop(columns="date", inplace=True)
return df
def recent_transaction_features(retail_df):
retail_df["lag_t28"] = retail_df["unit_price_k"].transform(lambda x: x.shift(28))
retail_df["rolling_mean_t7"] = retail_df["unit_price_k"].transform(lambda x:x.shift(28).rolling(7).mean())
retail_df['rolling_mean_t30'] = retail_df['unit_price_k'].transform(lambda x: x.shift(28).rolling(30).mean())
retail_df['rolling_mean_t60'] = retail_df['unit_price_k'].transform(lambda x: x.shift(28).rolling(60).mean())
retail_df['rolling_mean_t90'] = retail_df['unit_price_k'].transform(lambda x: x.shift(28).rolling(90).mean())
retail_df['rolling_mean_t180'] = retail_df['unit_price_k'].transform(lambda x: x.shift(28).rolling(180).mean())
retail_df.fillna(0, inplace=True)
return retail_df
retail_df = recent_transaction_features(retail_df)
#correlation matrix
corrmat = retail_df.corr()
f, ax = plt.subplots(figsize=(15, 15))
sns.heatmap(corrmat, vmax=.8,square=False);
retail_df.to_csv("train.csv")
retail_df['date_of_sale'].max()
Timestamp('2161-05-21 00:00:00')
now = dt.datetime(2161,5,22)
rfm = retail_df.groupby('customer_id').agg({'date_of_sale' : lambda day : (now - day.max()).days,
'customer_id': lambda num : len(num),
'unit_price_k': lambda price : price.sum()
})
col_list = ['Recency','Frequency','Monetary']
rfm.columns = col_list
rfm["R"] = pd.qcut(rfm["Recency"],5,labels=[5,4,3,2,1])
rfm["F"] = pd.qcut(rfm["Frequency"],5,labels=[1,2,3,4,5])
rfm["M"] = pd.qcut(rfm["Monetary"],5,labels=[1,2,3,4,5])
rfm["RFM_Score"] = rfm["R"].astype(int) +rfm["F"].astype(int) + rfm["M"].astype(int)
seg_map = {
r'[1-3][1-2][1-3]': 'Low Engagement & Low Value ',
r'[1-3][2-5][1-5]': 'Old Activity & High Frequency',
r'[4-5][1-3][1-5]': 'Recent Activity & Low Frequency',
r'[1-5][3-4][1-3]': 'High Engagement & Low Value',
r'[4-5][4-5][3-5]': 'High engagement and high value'
}
rfm['Segment'] = rfm['R'].astype(str) + rfm['F'].astype(str) + rfm['M'].astype(str)
rfm['Segment'] = rfm['Segment'].replace(seg_map, regex=True)
rfm.head()
| Recency | Frequency | Monetary | R | F | M | RFM_Score | Segment | |
|---|---|---|---|---|---|---|---|---|
| customer_id | ||||||||
| 0 | 52 | 31 | 49956.940000 | 3 | 4 | 4 | 11 | Old Activity & High Frequency |
| 1 | 5 | 50 | 84452.534171 | 5 | 4 | 4 | 13 | High engagement and high value |
| 2 | 193 | 5 | 2497.660000 | 2 | 2 | 1 | 5 | Low Engagement & Low Value |
| 3 | 774 | 1 | 1540.000000 | 1 | 1 | 1 | 3 | Low Engagement & Low Value |
| 4 | 960 | 1 | 1870.000000 | 1 | 1 | 1 | 3 | Low Engagement & Low Value |
rfm.groupby('Segment').mean().sort_values('Monetary',ascending=False)
| Recency | Frequency | Monetary | RFM_Score | |
|---|---|---|---|---|
| Segment | ||||
| High engagement and high value | 6.622642 | 209.188679 | 296625.338121 | 13.716981 |
| Old Activity & High Frequency | 208.708333 | 38.229167 | 73879.013049 | 9.375000 |
| Recent Activity & Low Frequency | 10.032258 | 8.483871 | 23777.369975 | 8.967742 |
| High Engagement & Low Value | 6.000000 | 43.400000 | 23714.504000 | 11.400000 |
| Low Engagement & Low Value | 409.376812 | 2.550725 | 6430.792724 | 4.956522 |
rfm.groupby('Segment').mean().sort_values('Recency',ascending=True)
| Recency | Frequency | Monetary | RFM_Score | |
|---|---|---|---|---|
| Segment | ||||
| High Engagement & Low Value | 6.000000 | 43.400000 | 23714.504000 | 11.400000 |
| High engagement and high value | 6.622642 | 209.188679 | 296625.338121 | 13.716981 |
| Recent Activity & Low Frequency | 10.032258 | 8.483871 | 23777.369975 | 8.967742 |
| Old Activity & High Frequency | 208.708333 | 38.229167 | 73879.013049 | 9.375000 |
| Low Engagement & Low Value | 409.376812 | 2.550725 | 6430.792724 | 4.956522 |